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Abstract 

An instance of a group testing problem is a set of objects O 
and an unknown subset P of O. The task is to determine P 
by using queries of the type "does P intersect Q" , where Q is 
a subset of O. This problem occurs in areas such as fault de- 
tection, multiaccess communications, optimal search, blood 
testing and chromosome mapping. Consider the two stage 
algorithm for solving a group testing problem. In the first 
stage a predetermined set of queries are asked in parallel 
and in the second stage, P is determined by testing indi- 
vidual objects. Let n = \0\. Suppose that P is generated 
by independently adding each x E O to P with probabil- 
ity p/n. Let qi (52) be the number of queries asked in the 
first (second) stage of this algorithm. We show that if qi = 
o(log(ra) log(ra)/log log(ra)), then Exp(g2) = n ~°^ \ while 
there exist algorithms with qi = 0(log(ra) log(ra)/log log(ra)) 
and Exp(g2) = o(l). The proof involves a relaxation tech- 
nique which can be used with arbitrary distributions. The 
best previously known bound is qi -|-Exp(g2) = f^(plog(ra)). 
For general group testing algorithms, our results imply that 
if the average number of queries over the course of ra"' (7 > 0) 
independent experiments is 0(ra ~'), then with high prob- 
ability Q(log(ra) log(ra)/log log(ra)) non-singleton subsets are 
queried. This settles a conjecture of Bill Bruno and David 
Torney and has important consequences for the use of group 
testing in screening DNA libraries and other applications 
where it is more cost effective to use non-adaptive algorithms 
and/or too expensive to prepare a subset Q for its first test. 

1 Introduction 

An instance of a group testing problem is a set of 
objects O and an unknown subset P of O. The task 
is to determine P by using queries of the type "does 
P intersect Q" , where Q is an arbitrary subset (pool) 
of O. An element x ^ O is positive if x G P, 
negative otherwise"'^ . A pool is said to be positive if 
one of its objects is positive, negative otherwise. The 
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Positive elements are usually referred to as defectives in the 
literature. This choice of terminology is unfortunate, as in most 
applications of group testing today, no defect is implied. 



determination of whether P intersects Q is called a test 
of Q. See Section 2 for a brief overview of the history 
and applications of group testing. 

Algorithms for solving group testing problems can 
be classified by the degree to which they are adaptive. 
General adaptive algorithms can be modelled by an 
arbitrary binary decision tree, where each node corre- 
sponds to a pool Q, and which child to consider next 
is determined by the outcome of the test of Q. Com- 
pletely non-adaptive algorithms (also called one-stage 
algorithms) are defined by a set of pools Q, where all 
the pools in Q are tested in parallel and the set P has to 
be determined from the outcome of the tests. A nearly 
non-adaptive algorithm that is of great interest for many 
screening problems is the trivial two-stage algorithm. 
Such an algorithm proceeds in two stages. In the first 
stage, the members of a fixed set of pools are tested in 
parallel and in the second stage only individual objects 
are tested. Which individual objects are tested may 
depend on the outcome of the first stage. 

Research in the theory of group testing tradition- 
ally falls into two categories: probabilistic group testing 
and combinatorial group testing. In probabilistic group 
testing, a probabilistic model for the occurrence of pos- 
itives is assumed, and the group testing procedure is 
optimized for minimum expected cost subject to con- 
straints. In combinatorial group testing, it is assumed 
that the set of positives can be any member of a given 
family of sets !F, and the task is to find the algorithm 
which requires the minimum number of tests to uniquely 
determine a _P G JF in the worst case. Combinatorial 
group testing is covered in detail in [6]. 

Here we resolve some questions about optimal algo- 
rithms for probabilistic group testing. Let the set of pos- 
itive objects be distributed according to Prob(_P), the 
probability that P is the set of positives. The main con- 
tribution of this work is the development of a technique 
for obtaining lower bounds on the tradeoff between the 
number of pools t; = |Q| and the expected number d 
of queries that must be asked in the second stage. A 
fundamental result obtained by using this technique is 
as follows. 

Theorem 1.1. Consider the group testing problem 
with n objects and v first-stage pools, where the set of 
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positive objects P is a uniformly randomly chosen p- 
tuple. Then the expected number p of objects which occur 
only m positive pools is at least (n/2^'P)(l — j) — k. 
This is a restatement of Theorem 3.2 which is proved 
in Section 3.3. Note that if every -P C Q, has even a 
very small chance of being the set of positive objects, 
then p — p is a lower bound on the expected number of 
individual queries that must be asked in a trivial second 
stage to completely determine the positive objects. 

We use Theorem 1.1 to obtain two results of practi- 
cal interest. The first result applies to trivial two-stage 
group testing algorithms. Let the distribution of P be 
determined by randomly and independently adding each 
object to P with probability p/n (p constant). We say 
that P is Bernoulli with parameter p. 

Theorem 1.2. If v < /3^ ln(n)log2(n)/lnln(n), 
then d > n^~'^"~°^ '> . There are two-stage algorithms 
with V = 6"'^"'"'' ln(n) ln(n)/ lnln(n)(l -|- o(l)) and d = 
0(n-/'/2), 

Theorem 1.2 hints at a strong threshold behavior of d 
given V. 

The lower bounds obtained by our technique are 
substantially stronger than previously known ones. In 
the case where Exp(|_P|) = o(n), the best known 
bounds have been obtained by information theoretic 
arguments. The expected number of queries is at least 
I{P) = Xlpco Pi"ob(_P)log2(Prob(_P)), on average. For 
the case where P is Bernoulli with parameter p, I(P) = 
plog2(n)(l + o(n)). A notable feature of Theorem 1.2 is 
that the bounds are independent of p. 

General algorithms for group testing can achieve the 
information theoretic bound to within a constant factor 
in the unit cost per query model. The simple bisection 
strategy finds p positives in at most 2p[log2(n)] tests. 
If the distribution of P is uniform over all p-tuples of 
O and p = o(n), then the two-stage adaptive strategy 
also achieves this bound to within a constant factor 
([4], Section 4). Theorem 1.2 shows that in general the 
two-stage strategy does not achieve this bound, even 
if I(P) = 0(log2(n)). Intuitively, the reason for this is 
that the design of the pools must accommodate numbers 
of positives which are much larger than average, even 
though the probability of such an event is very small. 

The second result derived from Theorem 1.1 ap- 
plies to arbitrary adaptive algorithms. The lower 
bound of Theorem 1.2 implies that in n'* indepen- 
dent determinations of P for the same set of objects, 
with high probability either the average number of 
individual object queries is i2(n^~^) (0 < e < 1) 
or the number of non-singleton pools constructed is 
il(log(n) log(n)/ loglog(n)). The details are given in 
Theorem 5.1. 



2 Overview of group testing applications and 
significance of results 

Group testing has been a much researched topic since 
the problem was formally published as a potential 
approach for economical blood testing [5]. In the 
blood testing problem, the task is to efficiently find 
the few samples which are positive for a disease such 
as syphilis by pooling samples and testing the pools. 
The basic idea is that if a pool is found to be negative, 
then all the samples which contributed to this pool 
can be excluded and do not have to be individually 
tested. This idea has since been used for quality 
control in product testing (when multiple items can 
be tested simultaneously) [20], searching files in punch 
card storage systems [15], efficient access of magnetic 
core memories [15], sequential screening of experimental 
variables [16], efficient algorithms for multiple-access 
systems and communication [17], and unique sequence 
screening of clone libraries [3, 4]. It has also been 
used in coding theory, optimal search, and the design 
of algorithms. 

Adaptive and non-adaptive group testing algo- 
rithms seem to have been discovered and discussed in- 
dependently. The first published paper on group test- 
ing [5] discusses a simple adaptive algorithm for prob- 
abilistic group testing. This paper gave rise to many 
studies of adaptive group testing algorithms both in 
probabilistic and combinatorial contexts (for example 
[20, 19, 18]). Non-adaptive group testing methods were 
discovered somewhat later in the context of efficient 
searching of punch card files and accessing magnetic 
core memories (see [15] and the references therein). 
The first methods used for these applications were ran- 
domized and and used primarily in a probabilistic con- 
text. Kautz and Singleton [15] first used combinato- 
rial methods from coding theory to obtain combina- 
torial non-adaptive group testing algorithms. Kautz 
and Singleton's work was continued by Dyachov and 
others in the Soviet Union [7, 8, 9, 10, 11] and else- 
where [13, 14, 21, 17, 12, 3, 1, 4]. 

The results of this paper have immediate implica- 
tions for applications of group testing where algorithms 
with more than two stages are undesirable and/or 
pools can be reused but the initial construction of non- 
singleton pools is expensive compared to testing. In 
most previous studies of group testing, the cost model 
assumes unit cost per query and does not consider pool 
construction independently from pool testing or testing 
of individual objects. This is appropriate for problems 
where pools cannot be reused or are cheap to build. It 
does not address the actual costs encountered in screen- 
ing clone or protein libraries, which are possibly the 
most active current applications of group testing. 
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Nearly every laboratory involved in the mapping of 
chromosomes uses group testing for library screening. In 
this application, the chromosome or genome of interest 
(basically a sequence of DNA) is randomly cut into 
many overlapping pieces of similar sizes. These pieces 
are replicated in clones and stored in large genomic 
libraries (1000 to 75000 clones, there has been discussion 
of libraries with up to 10® clones). The first task is 
to determine the arrangement of these clones on the 
original sequence of DNA. One of the methods for doing 
this is to obtain for each of a large set of unique sites 
the set of clones which contain this site. Once this is 
done, the sites and clones are ordered and localized^. 
One would then like to use the clone libraries to find 
genes and other features of interest. This also requires 
screening the clones. 

There are two features which distinguish the library 
screening problem from many other applications of 
group testing. The first is that the same library will 
be tested for many sites. In fact, the number of sites 
that must be tested before the sequence can be reliably 
reconstructed is usually of the order of the number 
of clones in the library. This is why Theorem 5.1 is 
relevant. The second feature is that pool construction is 
very costly. It is generally feasible to construct a number 
of pools (much fewer than the number of clones) initially 
by exploiting parallelism, but adaptive construction of 
pools with many clones during the testing procedure 
is discouraged. The technicians who implement the 
pooling strategies generally dislike even the 3-stage 
strategies that are often used. Thus the most commonly 
used strategies for pooling libraries of clones rely on a 
fixed but reasonably small set of non-singleton pools. 
The pools are either tested all at once or in a small 
number of stages (usually at most 2) where the previous 
stage determines which pools to test in the next stage. 
The potential positives are then inferred and confirmed 
by testing of individual clones. In most biological 
applications each positive clone must be confirmed even 
if the pool results unambiguously indicate that it is 
positive. This is to improve the confidence in the results, 
given that in practice the tests are prone to errors. 

The first formal study of how to best screen libraries 
of clones is due to Barillot et al. [3]. They showed that 
fairly efficient trivial two stage strategies can be ob- 
tained by using simple geometric constructions for the 
pools. It has since been realized [2] that these construc- 
tions correspond to simple error correcting codes and 
were already discussed in [15]. In [4] randomized con- 
structions are shown to be very useful for library screen- 



ing. These types of randomized constructions were orig- 
inally used for the file searching application [15] and 
were thoroughly analyzed by Dyachov [8] in a combina- 
torial group testing context. 

For screening libraries of clones, the distribution of 
P, the set of positive clones, is well approximated by the 
Bernoulli distribution, at least for the first 0(n) screen- 
ings. Recently computational studies of W.J. Bruno 
(unpublished) showed that for the Bernoulli distribution 
the number of pools required for successfully obtaining 
the positive clones with few individual tests appeared to 
grow substantially faster for randomized constructions 
than the information theoretic lower bound. He conjec- 
tured that such a growth was necessary for the Bernoulli 
distribution. The results of this paper imply that this 
conjecture is true. 

The results given here still hold provided that 
the distribution of positives has a sufficiently large 
probability of il(log(n)/ loglog(n)) positives (see the 
proof of Theorem 3.3). This means that the effects 
of the bound will be observable provided that the 
distribution is close to a Bernoulli distribution, as 
is the case for the initial screenings of a library of 
clones. This is due to the random nature of the 
construction of clone libraries. In a fixed library, 
the information obtained from previous screenings will 
eventually constrain the possible results. However, 
unless the ordering of the tested sites is known in 
advance, the number of screenings required to observe 
this is il(n). Other group testing applications such as 
blood testing or affinity testing for proteins may show 
fewer dependencies between screenings. It is therefore 
likely that the tradeoffs of Theorems 1.2 and 5.1 can be 
observed in practice. Provided that it is indeed desirable 
to completely determine the set of positive objects, the 
number of non-singleton pools that must be constructed 
is substantially larger than required by the information 
theoretic bound. 

The remainder of this work is organized as follows: 
Section 3 describes a general technique for obtaining 
lower bounds for non-adaptive group testing strategies. 
The lower bound of Theorem 1.2 follows immediately 
from Theorem 3.4 which is proved at the end of this 
section. In Section 4 we use the probabilistic method 
to obtain the upper bound of Theorem 1.2. This bound 
is a consequence of Theorem 4.1. In Section 5 it is 
shown how Theorem 1.2 can be used to obtain bounds 
on the minimum number of non-singleton pools that are 
constructed by any algorithm. Some open problems and 
directions for future work are given in Section 6. 



How to obtain the reconstruction from this (usually imper- 
fect) data is itself the topic of intensive research in approximation 
algorithms. 
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3 Lower bound methods 

The technique for obtaining lower bounds on the trade- 
off between the number of pools and the number of 
second stage queries in two-stage algorithms relies on 
a combinatorial relaxation technique which transforms 
the problem to a linear program. The initial linear pro- 
gram is simplified for symmetric distributions so that in 
principle it could be solved exactly. We estimate its op- 
timum value by going to the dual and applying a simple 
greedy method to obtain good bounds. 

3.1 Relaxation to a linear program. Consider the 
general problem of constructing informative pools. Let 
Q be a set of v pools. Write S = 2^ . Any function 
g : O ^ 2^ determines a way of pooling the objects by 
adding each object x to the pools in g(x). If P is the 
set of positive objects, then the set of positive pools is 
given by 

I J g(^x) = {y I 3a; G _P such that y G g{x)}. 
xeP 

Let Prob(_P) be the probability that _P C is the set 
of positive objects. 

Suppose all the pools of Q are tested and the set 
of positive pools is Qp. An object x can be positive 
only if g(x) C Qp. Such objects are called candidate 
positives. A candidate positive object x with x ^ P 
is called unresolved negative. Let d be the expected 
number of unresolved negative objects. In many cases 
of interest, at least the unresolved negative objects must 
be examined by any second stage individual testing 
method. This occurs in particular if Prob(_P) > for 
each P, which is satisfied by the Bernoulli distribution. 
In general, rf is a good estimate of the number of second 
stage tests whenever the distribution is sufficiently rich. 
Note that in practice, it is often the case that all 
candidate positive objects of interest are confirmed 
negative or positive to improve the confidence (for some 
alternative approaches, see [4]). 

Let P be the set of candidate positive objects. 



P 



{yeO \ g{y)C [j g{x)}. 



xeP 



The operation _P ^ _P is a closure operation which 
frequently occurs in the study of union-closed families 
of sets and lattices. The expected number of unresolved 
negative objects d is computed as 



(3.1) 



Y^ pioh{p)\p\p\. 



PCO 



The goal is a lower bound on d for the given distribution 
and number of pools by minimizing d over all functions 



g. As it stands, this optimization problem is difficult. 
We can however relax the problem to linear program- 
ming by allowing physically impossible pools. 

To obtain the desired linear program L(n,S), we 
shift the problem to O. The choice of pools is replaced 
first by a choice of a suitable closure operation P ^ P. 
A subset of O is closed if it is given by P for some 
PCO. The closed subsets of O form an intersection- 
closed family of sets which contains O. Two important 
observations are: (1) The number of closed subsets is 
at most S, since every subset U of the pools determines 
the closed set {x \ g(x) C U} and all closed sets can 
be obtained like this. (2) For each P, P is the unique 
minimal closed subset which includes P. 

A weak fractional version of the closed sets can 
be described by a set of variables wy and wuy for 
U C V C O subject to the constraints 



(3.2) 



wv > 0, wu,v > 0, 



(3.3) XI "'^^'^' 

V 

(3.4) for each U: ^ wjj^v > 1, 

V:UCV 

(3.5) for each U CV: wjjy < wy ■ 

The closed sets correspond to the variables wy and the 
cardinality constraint is enforced by inequality (3.3). 
How much of each "closed" subset corresponds to the 
unique minimal one which includes a given V is now de- 
scribed by the variables wjjy- Inequalities (3.5) ensure 
that the amount of V which includes U does not exceed 
the degree to which V is "closed". Inequalities (3.4) en- 
sure that each U is included in a total of at least one 
"closed" subset. 

Formally, a feasible solution of L(n,S) can be 
obtained from a g : O ^ V hy defining 

wu = [U = U], 
wu,v = [V = U], 

where for any logical expression (f) , [(f)] = I if (f) is true, 
and [(/)] = otherwise. 

A lower bound on d is now determined by the 
minimum value val(_L(n, S), Prob) of 



(3.6) 



XProb(t/) J2 ^u.v\V\U\. 



V:UCV 



Given that the number of variables is 2" -|- 3", the 
problem of evaluating val(_L(n, 5), Prob) is impractical 
in general. However in many cases of interest, the 
probability distribution is symmetric, that is, Prob(t/) 
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depends only on \U\. Since L(n,S) itself is symmetric, is successful at increasing /* if the adjustment of v is 

the number of variables can be substantially reduced sufficiently small. 

by assuming that wy and wuy depend only on the For each i, let 

cardinalities of U and V . In particular, in the symmetric 

case, L(n, S) is equivalent to L'(n, S) with variables Wj «(*') = ™i°i« I («)' > (")»/5'}> 

and «;,,, for < i < j < n and constramts ^^^^^ ^^^^ ^ ^^^ _ ^y^^^^ _ i + 1) is the i'th fallmg 



(3.7) Wj > 0, Wij > 0, 



factorial of s. 

Theorem 3.1. 

(3.8) J:(%s<S, val(r(.,5)^rob) > 



^'■'^ '''''^'^'' ,SO--r--'' ^— E §Kowo-.). 



i:i<.j <.s(i) 



(3.10) for each i < j: Wij < Wj. 

Proof. By following a greedy strategy of finding a 
The quantity to be mmimized is g^^^ ^^^^^^^^ solution, one obtains 

(3.11) E^OE ("Z-)"''.i(i-0, v,=piiMi)-i). 

•'•'-' To extend this to a feasible solution of L*(n, S), let 

where p(i) = Y:\u\=i Prob(t/). /„ _ A 

To find useful lower bounds on val(i.'(n, S), Prob), Vij = I . _ .jp(i)(s(i) - J) [t < J < s(i)\. 

we can use linear programming duality. The dual 

program L*(n,S) has variables v (for inequality (3.8)), This ensures that inequalites (3.14) are satisfied. To 

Vi (0 < i < n, for inequalities (3.9)) and Vij (0 < i < satisfy inequalites (3.13) requires that for each j, 
j < n, for inequalities (3.10)). The constraints are 

,,.,, ^n ^n ^n (")^ > E ^'.i' = E [^ < ^«] (" ",- )^(0W0 " i)- 

(3.12) v>Q, Vi>Q, Vij>Q, \JJ ^^. ^^. \]-y 

•sr-^ fn\ Thus we can let 

(3.13) for each j: > v; i — \ t; < 0, 

4 ^^^ " V- 0> 



(3.14) for each i < j: 

(3.15) (":J)^.-^.. < K0("i;')(i-0 



i:i<j<s(i) ^ ' 



For /* we now get 



/* = ^p(i)(s(i)-0-5'max ^ — f p(i)(s(i)-i)- 
The value val(_L*(n, 5), Prob) is given by the maximum ' i\i<3<s{i) 

value of (3-1'^) 
/g ]^g\ ^* _ V^ y. _ gy The right-hand side of the inequality in Theorem 3.1 

■ can be reasonably estimated for any symmetric distri- 
bution and gives a lower bound for the optimum value 

3.2 A simple feasible solution for L* . of rf given 5. 
By the duality theorem of linear programming, 

val(_L'(n, 5), Prob) = val(_L*(n, 5), Prob) and any feasi- 3.3 Evaluation of the bound for the Bernoulli 

ble solution yields a lower bound. We use what amounts distribution. For the distributions of interest here, we 

to a greedy method to find a reasonably good solution, can obtain a lower bound on d by conditioning on the 

The idea is to increase each Vi and adjust the other vari- case where the number of positives is fixed. Thus we 

ables as long as /* increases as a result. If one of the estimate d by considering the distribution Pk(i) = [i = 

inequalities (3.14) is violated, compensate by increasing k]. Let hk(n,S) = val(L*(n, S),pk). The minimum 

the values of the appropriate Vij. To satisfy inequali- expected number of unresolved negatives can then be 

ties (3.13) may require increasing v. The change of Vi estimated hy p(k)hk(n, S). 
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Lemma 3.1. For all n, S and i, s{i) > -jljj. 

Proof. We have (s(i))i > (n)i/S. Since s(i) < n, 
(s(i))i/(n)i < (s(i)/ny and the result follows. 

Theorem 3.2. 

bk{n,S) >s{k){l-A/k)-k. 

Proof. We can assume without loss of generality 
that s(k) > k and k > 2. We have 

bk{n,S) > s{k) — k — S max -——{s{k)—j). 

j:k<j<s(k) (n)k 

Observe that by the definition of s(k), S/(n)k < 
l/(s(k) — l)k. Hence 



&fc(n, S) > s(k) — k- 



{s{k) - l)k j:k<j<s(k) 



max (j)k(s{k)-j)- 



Write s = s(k) and let f(j) = (j)k(s—j)- The maximum 
of f(j) occurs at a j such that f(j)/f(j — 1) > 1 and 
/(i + l)//(i) < 1. Smce f(j)/f(j -1) = (,s- j)j/(i,s - 
J + l)(i ~ k))t the following are equivalent: 

/(i)//(i-i) > 1 

sj-p > -{s+l)k + {k + s + l)j-f 
(s+l)k > (k+l)j 
{s + l)k/{k+l) > j. 

This implies that the maximum occurs at jm = min(s — 
l,l(.s+l)k/(k+l)\). 

f{jm)/{s- l)k 

< (i(.s+i)(k/(k+m/(,s-i)f{-s-j^) 

< (is + l)/is - l)f 1/(1 + 1/kfis + l)/{k + 1) 

< ((i + i/fc)/(i-i/fc))n/(i + iAf(.A) 

< 4(s/A;). 

where we used s > k >2 and the fact that (1 — \/k)^ is 
increasing in k. The proof of the theorem is completed 
by substituting this inequality for the last summand in 
the lower bound for hk{n, S). 

Theorem 3.3. Leia> j3>Q,k = a/(n)(l + o(l)) 
and V < l3log2(n)f(n), with f(n) = o(n^) for some 
< e < 1 — l3/a and 1 = o(f(n)). Then 

hk{n,S)>n^-l^l"-<^\ 

Proof. Applying the previous results gives 

hk{n,S) > s{k){l - o{X)) - k 

> nlS^I\l-o{X))-k 

- „l-/3/«-o(l) 



For the rest of this section, let p{i) be determined 
by the Bernoulli distribution with parameter p, so that 
p{i) = (")(^)'(1 — ^)"~'. The following theorem implies 
the first half of Theorem 1.2. 

Theorem 3.4. Lei Q < j3 < 1/2, For v < 
j3'^ ln(n) log2(n)/ Inln(n), 

J>„l-2/3-o(l)^ 

Proof. We can estimate d by p(k)hk(n, S) with 
k = aln(n)/ lnln(n)(l + o(l)). The probability p(k) 
is bounded as follows: 



p{k) 



n\ ( p\^ ( p^ "~*^ 



k) \n 

- k'' 

^ ^-kln(k)(l + o(l)) 
= ^-"(l + Hl)). 

Using Theorem 3.3, this gives 

d > p(k)hk(n, S) > „-«+i-/5V«-o(i). 

Let a = (]. Then d> rji-^/'-Hi). 

It is clear that a result such as Theorem 3.4 holds 
for any distribution p which satisfies p{k) > n~"~°^^) for 
suitable k. In fact, any k for which this holds implies a 
tradeoff similar to that in Theorem 1.2. 

4 Probabilistic construction of pools 

The probabilistic method can be used to show that the 
pools can be chosen in a nearly optimal fashion. Let 
V be the set of v pools. In this section we specify the 
relationships between the objects and the pools by an 
n X V incidence matrix I, where I^^y = 1 if object x 
occurs in pool y and I^^y = otherwise. Consider I 
as a random variable with distribution determined by 
randomly and independently setting each I^^y to 1 with 
probability q. This and related models are frequently 
used for probabilistic constructions involving general 
families of sets and have been applied to one-stage non- 
adaptive group testing algorithms [7, 8, 9, 12]. 

Let d(I) be the expected number of unresolved 
negatives for the pools constructed according to I if the 
probability of exactly i positives is p(i). The following 
calculations can be done for more general distributions 
of the positive objects to obtain similar results. 

Lemma 4.1. 

n 

diI)=J2piiKn-iKl-qil-qyy. 

8 = 
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Proof. Assume that there are i positive objects 
xi, . . . , Xi, and a; is a negative object. For x to be 
unresolved negative requires that for every pool y the 
following is not the case: a; is in j/ (probability q) and 
for each Xi, Xi is not in y (probability (1 — q)')- Thus 
the probability that x is unresolved negative is given by 
(1 — q(l — qfY ■ Given that there are i positive objects, 
there are n — i potential unresolved negative objects. 
Using linearity of expectations yields the sum in the 
lemma. 

Consider now the case where the distribution p(i) 
is determined by each object's being independently 
positive with probability p/n. 

Lemma 4.2. 



^g-fcln(fc)(l-o(l)) 
^(-e/2-S)(l-o(l)) 



Bound ^i as follows: 



d(I)<nJ2 



ILg-3(l-3)"«, 



8 = 



Proof. 

n 



< "Ei •)(-] (1 --)"-'(! -?(i-?rr 



< nJ2^(l-q(l-qyr 



8 = 



8 = 



Theorem 4.1. 
Let e > 0, V = e"'^"'"'^ ln(n) ln(n)/ lnln(n)(l + o(l)) and 
q = Inln(n) / ln(n) . Then 

d(L) = 0(n-''^). 

Proof Let k = (1 + e/2 + (5) ln(n)/ Inln(n) with 
6 > constant but sufficiently small as required by the 
calculations below. Divide the sum of Lemma 4.2 into 
two parts, 



Si = 






i:0<i<k 


S2 = 






8:8>fc 


Estimate ^2 by 




^2 


S -El 




i-.iyk 




k 



Si < ne-^^i-^^'^eP 

- „g-exp(l + e)exp(-l-e/2-;))(l-o(l))ln(n) 
= ^l-exp(e/2-;i)(l-o(l)) 

where we have used the fact that 



-l-e/2-« 



(i-o(iA)). 



We now have d{L) < 5i + ^2 = Oin-'l"^). 

5 Implications for general algorithms 

To see what Theorem 1.2 implies for general algorithms 
applied to independent instances of the Bernoulli distri- 
bution requires estimating the probability that d > n^ . 
Lemma 5.1. Under the assumptions of Theo- 
rem 3.4, Prob((i > n^~^^~^) > n~^''~°(^) for any fixed 
e>0. 

Proof. Since d < n we can use the Chebyshev 
inequality as follows: 



Prob(n-rf> n-n'-^'^-') < 



Exp(n 



< 



n — n^ 2/3 e 
„_„l-2/3-o(l) 



n — n 



l-2/3-e 



Hence 



Prob(rf > n'-'^-') > 1 



1- 


„_„l-2/3-o(l) 


„_„l-2/3-e 
-2/3-0(1) _ „l-2/3-e 



n — n^ 2/3 e 
= n-^/'-Hi). 

Theorem 5.1. Let < j, < (] < 1/4 and 
2/3 < 7, Suppose that a group testing algorithm is 
applied to rC independent Bernoulli instances of P. 
Then with probability at least 1 — e~" , either 

the average number of individual tests is greater than 
j^i-4/3-o(i) gj, jjjpj,g ffidji /3'-^ ln[n)log2{n) / lnln(n) non- 
smgleton pools are constructed. 

Proof. Let e > be arbitrarily small. Let E be the 
event which consists of constructing a new non-singleton 
pool, or individually testing at least n^~'^P~^l'^ objects 



8 



E. KniU 



, or already having > /3^ ln(n) ln(n)/ Inln(n) pools. By 
Theorem 1.2 and Lemma 5.1, for each instance of P 
the probability of E is at least n~^''~°(^). Let n(E) 
be the number of events E that occur in n'* instances. 
We would like to relate the distribution of n(E) to a 
binomial distribution and apply tail estimates for the 
binomial distribution. That this can be done follows 
from the next lemma. 

Lemma 5.2. Let E C {1, . . . , k} be a random van- 
able which satisfies that for each U , Prob(i ^ E \ E Cl 
{l,...,i - 1} = U) > q. Let B C {I,..., k} be the 
Bernoulli random variable with Prob(i Q B) = q. Let 
T be an upward closed family of subsets of {I, . . . , k} 
(i.e. U ^ T and V D U implies V G ^). Then 
Prob(E eT)> Prob(B e T). 

Proof. We use induction on k. The lemma holds 
trivially for A; = 1. Let J^o = {U \ U e J^ A U C 
{2, . . . , k}} and Ti = {U n{2,...,k} \ U e T}. By 
upward closure, jFg (1T\. 

VYoh{E e T) 

= Pmh{leE AEn{2,...,k} eJ^i) 
+Prob(l ^ E AE eJ^o) 

= Fioh(ieE)Fioh(En{2,...,k}eJ^i I i e ^) 

+Prob(l E)Pioh{E efo I l^ E) 
> Prob(l ei;)Prob(5n{2,...,A;} e J^i) 

+Prob(l i;)Prob(5 n{2,...,k}e To), 

where the last step used the induction hypothesis twice 
for {2,. . .,k} with E' = E f] {2, . . . ,k} conditioned on 
1 G £", and E" = E conditioned on 1 £". The result 
follows by using the inequalities Prob(5 fl {2, . . . , A;} G 
Ti) > Prob(5 n {2, . . . , A;} e To) and Prob(l e E) > q. 
Using Lemma 5.2 with T = {U \\U\ > nT-2/5-<'(i) 
allows us to apply the usual tail estimates on the 
binomial distribution to obtain 



Prob(n(£') < n 



7-2/3-0(1) 



) <e-" 



T-2/3-o(l) 



where constants have been absorbed into n~°^ K Sup- 
pose that less than n-'^"'*''"'^ individual tests are per- 
formed on average and less than /3^ ln(n) ln(n)/ Inln(n) 
non-singleton pools are constructed. Then 



7-2/3-e/2 



n{E) < j3 ln(n) ln(n)/ Inln(n) -|- n 



for the maximum number of times additional non- 
singleton pools are constructed or at least n^~'^P~^l'^ 
objects are tested. It follows that 1 — e~" is 

a lower bound on the probability of the event in the 
theorem. 



6 Some problems 

Some of the interesting questions that are raised by this 
work include: 

Problem 6.1. Determine the precise nature of the 
threshold behavior of the tradeoff between v and d for 
two-stage algorithms. 

Problem 6.2. Given the information L{P) of the 
distribution of positives, what is the maximum gap 
between the the information theoretic lower bound and 
the number of pools and individual queries required by 
a two-stage algorithm? 

Problem 6.3. Consider an arbitrary symmetric 
distribution of positives. Is it true that up to a mul- 
tiplicative constant the optimal two-stage algorithm is 
obtained by the probabilistic construction of Section 4? 

Problem 6.4. Can similar lower bounds be proved 
for approximation algorithms, that is algorithms which 
either determine P with high probability, or find at least 
min(p, \P\) positives with high probability? 

Problem 6.4 is suggested by work described in [4]. Note 
that if we are allowed to fail to determine P in n~°^^^> 
instances, then the tradeoffs in Theorem 1.2 do not 
apply. 
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